Equitable machine learning counteracts ancestral bias in precision medicine, improving outcomes for all

Gold standard genomic datasets severely under-represent non-European populations, leading to inequities and a limited understanding of human disease [1–8]. Therapeutics and outcomes remain hidden because we lack insights that we could gain from analyzing ancestry-unbiased genomic data. To address this significant gap, we present PhyloFrame, the first-ever machine learning method for equitable genomic precision medicine. PhyloFrame corrects for ancestral bias by integrating big data tissue-specific functional interaction networks, global population variation data, and disease-relevant transcriptomic data. Application of PhyloFrame to breast, thyroid, and uterine cancers shows marked improvements in predictive power across all ancestries, less model overfitting, and a higher likelihood of identifying known cancer-related genes. The ability to provide accurate predictions for underrepresented groups, in particular, is substantially increased. These results demonstrate how AI can mitigate ancestral bias in training data and contribute to equitable representation in medical research.


Metastasis prediction in thyroid cancer (THCA)
Metastases have a substantial impact on patient prognosis, in this analysis we predicted whether patients would undergo metastasis (M0 versus MX). We trained 37 models on data from 436 individuals, divided into training batches of 14-18 samples each. The number of training sets correspond to the proportion of samples in the TCGA THCA database resulting in: 23 EUR, 1 AFR, 3 EAS, 1 ADMIXED and 9 MIXED models. Both PhyloFrame and the benchmark use the HumanBase thyroid gland network for network construction. PhyloFrame uses population data from gnomAD.
For each model (PhyloFrame or benchmark) we predicted subtypes for entire populations and calculated AUC. If a model was trained on the same population being tested (eg. EUR trained model being tested in EUR), we excluded the samples used for training from the larger set. We plotted AUC results for the benchmark compared to PhyloFrame (Supplementary Fig. 2F-J). resulting in 12 EUR, 2 AFR, 1 EAS, 1 ADMIXED and 6 MIXED models. Both PhyloFrame and the benchmark use the HumanBase uterine endometrium network for network construction. PhyloFrame uses population data from gnomAD.
For each model (PhyloFrame or benchmark) we predicted subtypes for entire populations and calculated AUC. If a model was trained on the same population being tested (eg. EUR trained model being tested in EUR), we excluded the samples used for training from the larger set. We plotted AUC results for the benchmark compared to PhyloFrame (Supplementary Fig. 2K-O).

Model stability and consistency
Precision medicine models should be identifying biologically-relevant signatures of disease. A signature may be accurate on a set of training data, but it has limited utility if it does not generalize to other data or identify the biological drivers of disease. To quantify the amount of biological overlap between signatures, we calculate several factors. First, we identify how many known cancer-related genes are identified in each signature, and which cancers they have previously been associated with. This is done using COSMIC cancer genes (see Fig. 3H,I). We calculate both the number of COSMIC genes identified by each model, and calculate a t-test comparing these numbers in PhyloFrame versus benchmark models. Second, we calculate the overlap in disease signatures for each model. Higher signature overlap is an indication that, despite different training data, the models are identifying the same factors as driving the disease. To quantitatively compare this overlap, we calculate pairwise model signature correlations. Signature correlations are calculated based on presence/absence of each gene in a model signature; we do not consider model weights for the genes. This resulted in a matrix of signatures by signatures, filled in with pairwise signature-signature correlations. PhyloFrame signatures have significantly higher overlap than benchmark models (mean 47% vs 2% overlap, Fig. 3G). We then ran a t-test comparing all PhyloFrame-PhyloFrame model pairwise correlations against all benchmark-benchmark model pairwise correlations, and found there is statistically higher likelihood of signature overlap in PhyloFrame compared to benchmark models.

Sample-specific model performance
Some samples are far more difficult to predict than others, and while overall model AUC is important, it is also valuable to see how each model performs in the more difficult sample sets. To see this, we plot the per-sample differences in model performance (% of models that correctly predict each sample; Supplementary Fig. 5 ), and identify which samples are often misclassified, if any. Each point in these plots represents one sample. In the boxplots, samples are grouped by ancestry and by model type (PhyloFrame vs benchmark). The y-axis shows the percent of models that correctly predict each sample. Both PhyloFrame and the benchmark models struggle to correctly predict a small subset of samples. For most of the BRCA samples, all models correctly predict BRCA subtype. A small subset of samples are incorrectly predicted by all or most of the BRCA models (see Fig. 4 for a critical factor explaining of this effect. UCEC and THCA models have far more variability, as expected, given the models have lower average AUC than the BRCA models. Metastasis is a harder prediction task than tumor subtypes, and so many of the THCA models have low performance. Of note, this per-sample performance is not shared across models; There is a wide range of success for each set of sample predictions. This suggests that there are factors relevant to the THCA models that are not being identified by the models. It is unclear if this is a tractable prediction task, given the small training data size. UCEC models similarly have varied performance for each sample, however most samples have a high % prediction success across models. The endometrial versus serous subtype UCEC model predictions for a subset of samples are unreliable. This prediction task is difficult due to the low number of serous samples available in the training data. Serous samples are only approximately 25% of the samples, and half of those samples are from individuals of European decent. For example, an East Asian model could not be trained in uterine cancer because there were only 3 serous samples. The samples most often misclassified in this prediction task are of the serous subtype.

COSMIC gene enrichment and presence
COSMIC currently includes 736 genes expert curated and validated as cancer-related based on previous studies. We used COSMIC in two sets of analyses. First, we used COSMIC genes to identify the EAF trends of disease-related genes, to assess the effectiveness of using EAF as an equity adjustment in AI methods. We found that there is no EAF enrichment in COSMIC versus non-COSMIC genes (Supplementary Fig. 4; t.test, p-value = 1), suggesting that the utility of EAFs in equitable AI is not limited to cancer studies. Second, we used COSMIC genes to determine the extent to which model signatures in this paper recapitulate known cancer processes and as a validation set of genes that ideally will be enriched in the disease signatures. As most of the models in this paper are trained on smaller sample sizes (due to ancestry bias in the data), we expect model overfitting. Identifying the number and variety of COSMIC cancer genes in each signature helps to determine how much of the signatures are cancer-related. COSMIC genes with high EAFs are more frequently enriched in African and East Asian but not European ancestries (Supplementary Fig. 4B; t-test, p-value ¡ 2.2e-16 ) . For example, FOXA1 is one of the five most frequently identified COSMIC genes by the benchmark BRCA models (Fig. 3H,I). While it is not one of the most frequently identified COSMIC genes by PhyloFrame models, more PhyloFrame (89%) than benchmark models (81%) include FOXA1 in their signatures (22 vs 24 of a total 27 models).

The impact of admixture
Continental and ethnic classifiers, including those used to group samples in this study, are flawed proxies for ancestral diversity. In an increasingly interconnected world, rates of admixed ancestry are likely to increase. Even in the present day, the extent and impact of admixture within human populations remains under-recognized, especially as it pertains to underrepresented groups in the United States. We sought to understand how admixture impacts the predictive power of PhyloFrame relative to the benchmark.
To explore the impact of admixture we examined predictive efficacy of models trained on EUR breast cancer data. We selected breast cancer because the overall high AUC across all models allows us to largely exclude noise generated by poor model performance, and instead identify which individuals the models' struggle to accurately predict. We used EUR training data because having more trained models (in this case, 54 models) enables us to precisely determine whether an individual is being stochastically or systematically incorrectly predicted. The current dramatic overrepresentation of Europeans in genomic databases [9, 34] provides additional value for this approach as it more closely mirrors expected real world studies across disease types.
We calculated the fraction of models that correctly predicted an individual's subtype and plotted those relative to their admixed ancestry proportion. Note that this includes individuals that are otherwise classified as Admixed (greater than 20% nonmajority ancestry) being grouped according to their majority ancestry. Next, we measured the statistical variance of model prediction accuracy across populations (R geom smoothing with LOESS model). Individuals with majority European ancestry show stable prediction across admixture levels (Fig. 4B), compared to PhyloFrame improvements on individuals from many ancestries and models, most notably the BRCA EUR models applied to Admixed and AFR samples ( Supplementary Fig. 5). Majority African ancestry individuals show both significant increases in model performance with increasing admixed ancestry and significantly better performance in PhyloFrame than in the benchmark. Given that the vast majority of admixed ancestry in African Americans is European, including in the individuals in this study (Supplementary Table 2), this highlights a shortcoming of current predictive methods that is easily overlooked when grouping individuals by continental level ancestry or by overall model AUC and other performance metrics. Overwhelmingly European training data sets may appear to perform acceptably in African ancestry individuals, when in fact performance is not uniformly high across the group. This raises substantial concerns surrounding existing models' abilities to provide insightful and accurate precision medicine predictions for individuals of un-admixed African ancestry.

External validation of BRCA models
To externally validate our model we chose to assess a dataset that both was outside of the training set used in the development of PhyloFrame and that provided an opportunity to assess the performance of PhyloFrame and the benchmark on ancestry groups not present in the training data. To meet these objectives we analyzed triple negative breast cancer (TNBC) data from Martini et al [30], comprised of 9 African Americans, 6 Ghanaians and 11 Ethiopians, totaling 26 TNBC patients (Fig. 5A).

Data and preprocessing
As with the previous analyses of breast cancer, PhyloFrame and the benchmark were tasked with classifying the samples as either basal or luminal. We applied the PhyloFrame and benchmark models trained using the same subdivisions of the TCGA BRCA data described above, resulting in 27 models (17 EUR, 2 AFR, 1 EAS, 1 ADMIXED and 6 MIXED). These trained models were applied to an external validation set, the Martini et al [30] TNBC data. Most basal breast cancers are also TNBCs, and the terms are often used interchangeably. Thus successful models should predict all of the validation set samples to be basal, as they are TNBCs. Because triple negative breast cancers are basal, accuracy functionally reduces to the proportion of the samples in each population that the models correctly identify as basal.

Model results
In African populations (Ghanaian and Ethiopian), all but one PhyloFrame model have performance greater than random chance accuracy (>50%). Mean PhyloFrame performance is higher than the benchmark model in both the Ghanaian validation set samples (mean recall PhyloFrame = 0.64 vs benchmark = 0.62) and the Ethiopian validation set samples (mean recall PhyloFrame 0.78 vs benchmark 0.70). Performance is more similar in the African American validation set samples (mean recall PhyloFrame = 0.42 vs benchmark 0.47).
Martini et al samples from the US were collected from New York City (New York), Detroit (Michigan), Ann Arbor (Michigan), and Birmingham (Alabama). TCGA samples also were recruited from the US, and TCGA BRCA samples came from 42 tissue source sites, including several in New York City. Given that the training data includes samples from overlapping populations, it was expected that the benchmark model would perform well in the African American validation set samples, however this is not the case (median = 0.56). Both the benchmark and PhyloFrame models have highly variable performance in the validation set African American samples, suggesting that neither set of models are able to fully disentangle complexities of ancestry and breast cancer subtypes. Given that the Basal subtype is enriched in African Americans [30,42,54], this prediction task may be intrinsically connected to ancestry; It has been previously suggested that the Basal/Luminal subtypes are unintentionally linked to African ancestry. However, there are distinctions between the two datasets, even within African American samples, that may explain some of the variability. TCGA BRCA samples were diagnosed with BRCA from 1988BRCA from -2003  Supplementary Figure 2 Equitable AI effectiveness. AUC of the benchmark versus PhyloFrame models when training in (A-E) BRCA, (F-J) THCA, and (K-O) UCEC using different populations for the training and validation data and varying the ancestral population of the training data. Rows correspond to training data used (ADMIXED, AFR, EAS, EUR, MIXED). MIXED indicates that the training data ancestry diversity matches that of the TCGA data; it is not representative of the global population distributions. Each dot coesponds to a combination of training and test data, color coded by test data.
Supplementary Figure 4 Transcriptome-wide EAF enrichment EAF density plots for (A) all genes and (B) COSMIC cancer genes, grouped by ancestry. A shows all 17 gnomAD ancestries and B shows the EUR, EAS, and AFR ancestries used to group the ancestry-specific training data sets for the AI models. Peaks demonstrate unique EAF across ancestries.
Supplementary Figure 5 Sample-specific model performance To ascertain which, if any, samples are harder to predict, we calculated performance of all models for each sample. Boxplots show the percent of models that correctly subtype each sample in (A,D,G,J,M) BRCA, (B,E,H,K,N) THCA, (C,F,I,L,O) UCEC. Samples are grouped by genetic ancestry and model type (PhyloFrame or benchmark). The EAS UCEC plot is greyed out as there are not enough samples to train the models. In each plot, each dot is a single sample and y-axis shows percent of models that correctly classify that sample.
Supplementary Figure 6 Effect of admixture on EUR-trained model performance. A comparison of models trained on EUR BRCA data and the percent of correctly predicted held-out EAS BRCA samples as admixture levels increase in PhyloFrame (dark green) and benchmark (light green) models.